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Description 

[0001] The present invention relates to character recognition techniques that categorize strings, such as strings of 
characters or other elements. 

5 [0002] Some conventional character recognition techniques improve recognition accuracy by taking into account 
different categories of character strings that can occur in text. The character recognition products of Xerox Imaging 
Systems, for example, employ several specialized recognition algorithms, each for a respective category of character 
string. One algorithm can look up words in a dictionary, another can recognize valid strings of arabic numerals with 
punctuation marks, and so forth. If a string of character candidates includes a candidate that has asubstantial probability 

10 for more than one character category, the ambiguity can be resolved by applying each recognition algorithm to the 
possible string that would result from each probable character category. If one of the possible strings is recognized by 
one of the algorithms, the recognition result can be used to resolve the ambiguity. If two or more possible strings are 
recognized by different algorithms, the sequence of categories of strings can be taken into account to resolve the 
ambiguity. 

15 [0003] US-A-4,003,022, describe string pattern recognition techniques. As shown and described in relation to Figs. 
1-4 and 7, a character is decomposed into a symbolic string pattern that is fed to sequential logic. At column 3 lines 
52-55 it is stated that the class for each symbol string pattern is determined by comparing it with all sequential logics 
and checking which has accepted the pattern. 

[0004] Sinha, R.M.K. and Prasada, B., "Visual Text Recognition through Contextual Processing," Pattern Recogn'h 
20 tion, Vol. 21 , No. 5 f 1 988, pp. 463-479, describe visual text recognition techniques. Dictionary methods are described 
beginning in the last paragraph on page 463. Section 2.1 on page 465 describes a partial dictionary. Section 2.6 on 
page 467 describes word boundary identification based on unambiguous punctuation marks. Section 2.7 on pages 
467-468 describes heuristics use to identify word boundaries, match with the dictionary, and traverse through a modified 
Viterbi net. Section 3 on page 468 and Figs. 3 and 4 describe dictionary organization and search, and relate specifically 
25 to a trie structure based dictionary. Fig. 3 shows a node structure, including a NEXT field, an ALT field, a character, an 
end of word mark, and a flag. Page 473, right hand column, notes that errors are encountered due to several causes, 
including ambiguous punctuation marks within a word and mapping of numbers to a dictionary word. 
[0005] Srihari, S.N., Hull, J.J., and Choudhari, R., "Integrating Diverse Knowledge Sources in Text Recognition," 
ACM Transactions on Office Information Systems, Vol. 1, No. 1, January 1983, pp. 68-87, describe text recognition 
30 techniques. Section 3 on pages 72-74 and Fig. 2 describe lexical organization, relating specifically to a letter trie. Fig. 
2(a) shows the fields of a node, including: a token, CHAR; a word-length indicator array of bits, WL; an end-of-word 
tag bit, E; two pointers labeled NEXT and ALTERNATE. 

[0006] WO-A-8301329 discloses items of data indicating a set of words, including stop word table 301 0 and entry 
(candidate) words 3022, but nothing in this document suggests that these items of data include sequences of data 
35 units that begin with character data unit subsequences, have acceptance data and have ending sequences with cat- 
egory data units. 

[0007] EP-A-0282721 describes dictionaries with paradigm numbers. The paradigm numbers provide information 
about part of speech, inflectional forms, and grammatical role. But this document does not teach or suggest sequences 
of data units that begin with character data unit subsequences, have acceptance data, and have ending subsequences 
with category data units. 

[0008] Revuz D, Dictionaries and Lexicons, Methods and Algorithms a doctoral theses published 1 991 and locatable 
of the Institute Blaze Pascal in Paris, France under the index LITP 91 .44, describes a representation of an automaton 
in which a state includes an indicator of whether it is final or not and a pointer to a list of its outgoing transitions. As 
shown in Fig. 10, each transition includes a character, a pointer to the next transition in the list, and a pointer to the 
45 state to which the transition leads. Section 3.6a describes an automaton with multiple terminal states, with each of 
which is associated one of a set of 8400 DELAF codes, and section 1 .2.2 indicates that the DELAF codes indicate the 
endings a simple word may take. 

[0009] Revuz discloses that a D ELAF code may be associated with a terminal state, but does not disclose a technique 
in which an ending subsequence for an acceptable string includes two category data units. 
so [0010] The present invention provides a product for providing information about strings of characters according to 
claim 1 . 

[0011] The present invention further provides a system according to claim 11 of the appended claims. 
[0012] One aspect of the invention deals with basic problems in conventional character recognition techniques that 
take character string category into account. 
55 [0013] As described above, some conventional techniques handle each character string category with a separate 
algorithm. This approach is time inefficient because each category's algorithm must be performed for each string of 
character candidates. Also, this approach is space inefficient because the data necessary to perform each category's 
algorithm, including instructions, must be stored separately. 
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[0014] This aspect is further based on the discovery of a technique that alleviates these basic problems. The tech- 
nique integrates the algorithms for all categories of strings so that only one algorithm is performed for each string of 
character candidates. The technique also integrates the data for each category of string, so that the data used by the 
integrated algorithm is very compact. 

5 [0015] This discovery is based on the observation that directed graph techniques, and specifically directed graphs 
that represent finite state machines, make it possible to integrate diverse string recognition algorithms. The acceptable 
categories of strings vary significantly, but each category can be expressed as a respective finite state machine (FSM). 
Even categories that include infinitely many possible strings, such as the arable numbers, can be expressed as cyclic 
FSMs. Therefore, the same lookup algorithm could be used with a directed graph representing each category's FSM. 

10 To realize space efficiency, the FSMs of all the categories can be composed into a combined FSM, so that only one 
directed graph is necessary to represent the combined FSM. To realize time efficiency, the lookup algorithm is only 
executed once in recognizing a string of character candidates. To resolve a string that is acceptable as more than one 
of the categories, the directed graph includes data indicating one or more categories for each acceptable string. 
[0016] String data according to the invention thus includes a sequence of data units that can be accessed using data 

15 indicating characters of a string. If the string is acceptable, the sequence of data units ends with a subsequence that 
includes information indicating that the string of characters is acceptable and information indicating a category set for 
the string. The processor of a character recognition system can therefore use the ending subsequence to obtain ending 
data indicating that the string is acceptable and indicating the string's category set. 

[0017] The technique described above is especially advantageous for data structured so that both prefixes and suf- 
20 fixes of strings can be collapsed to eliminate redundancy: Because the category indicators are relatively few in number, 
the data would include many occurrences of each category indicator if there were a category indicator for each ac- 
ceptable string. But if the category indicator of each string is stored at the end of the string's suffix, collapsing suffixes 
greatly reduces the number of category indicators in the data. Suffixes can be collapsed with well-known minimization 
algorithms applicable to directed graphs or other data representing FSMs. 
25 [0018] String data that includes category indicators as described above can be stored by a data storage medium 
such as a floppy disk. The data storage medium can also include control data, such as instructions, that a processor 
can use in accessing the string data. The processor of a system performing character recognition can therefore use 
the control data and the string data to obtain data indicating whether a string of characters is an acceptable string and 
indicating a set of categories for the string if it is acceptable. 
30 [0019] Embodiments of the invention will now be described, by way of example, with reference to the accompanying 
drawings, in which: 

Fig. 1 is a schematic block diagram showing components of a system in which a processor can use string data to 
obtain category set data for a string; 
35 Fig. 2 is a flow chart showing general steps by which the processor in Fig. 1 can match a character using the string 

data; 

Fig. 3 is a flow chart showing general steps by which the processor in Fig. 1 can obtain ending data using the 
string data; 

Fig. 4 is a schematic diagram of a character subsequence showing information it includes and its links to other 
40 subsequences; 

Fig. 5 is a flow chart showing steps in accessing subsequences like that in Fig. 4 in matching a character; 

Fig. 6 is a flow chart showing steps in accessing a subsequence like that in Fig. 4 in obtaining ending data indicating 

whether a string is acceptable and its category set; 

Fig. 7 is a schematic drawing showing a character subsequence of data units; 
45 Fig. 8 is a flowchart showing more detailed steps for implementing the technique of Fig. 5; 

Fig. 9 is a flowchart showing more detailed steps for implementing the technique of Fig, 6; 

Fig. 1 0 is a schematic drawing of an N-bit vector for returning category set data; 

Fig. 11 is a schematic block diagram showing components of a system using string data for recognition; 

Fig. 12 is a schematic drawing of a software product including string data. 

50 

A. General Features 

[0020] Figs. 1 -2 illustrate general features of the invention. Fig. 1 illustrates general features of string data that can 
be accessed to obtain acceptance data and category set data. Fig. 2 illustrates general steps in using string data tike 
55 that shown in Fig. 1 . 

[0021 ] Fig. 1 shows system 1 0, in which string data 1 2 can be used. String data 1 2 can, for example, be stored by 
a data storage medium accessible by processor 14; for example, string data 12 could be stored in a software product 
such as a floppy disk, CD-ROM, or the like. Processor 14 can also receive character data 1 6 indicating characters of 
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a string. Processor 14 can provide acceptance data 18 indicating that the string of characters is acceptable and also 
indicating a set of categories for the string. 

[0022] Fig. 1 shows sequence of data units 30 within string data 12. If character data 16 indicates the characters 
"mix," for example, sequence 30 may include subsequence 32 indicating the character "m," subsequence 34 indicating 
5 the character "i " and subsequence 36 indicating the character "x." Each subsequence can include one or more data 
units. 

[0023] A subsequence can include other information in addition to information indicating a character. For example, 
subsequence 36, in addition to information indicating the character "x," also includes information indicating that the 
string "mix" is acceptable and that its categories include word and roman numeral. This information can be used by 
io processor 14 if it reaches the end of the string "mix" while accessing subsequence 36. Processor 14 can use this 
information to obtain acceptance data indicating that "mix" is an acceptable string and to obtain category set data 
indicating a set of categories for "mix" that includes word and roman numeral. 

[0024] Fig. 2 shows steps processor 1 4 could follow in using string data 1 2 to match a character. The step in box 50 
begins by receiving a request to match a character and an entry location. This entry location could be a location rep- 

15 resenting a start state of an FSM or it could be another location within string data 12 from which data units can be 
accessed that indicate a set of alternative characters that can follow the entry location. For example, it could be the 
location of a character's subsequence of data units. The data units that can be accessed from the entry location can 
include a respective subsequence of data units for each of the alternative following characters. 
[0025] The step in box 52 accesses the entry location in string data 12. The step in box 60 then accesses the data 

20 units that indicate alternative following characters until it determines whether the character matches one of the char- 
acters that can follow the entry location. If so, the step in box 62 returns the location of the matching character's 
subsequence, but if the character does not match any of the alternative following characters, the step in box 64 returns 
data indicating that the string being matched is not acceptable. 

[0026] Fig. 3 shows steps processor 14 could follow in using string data 16 to obtain ending data for a string. The 
25 step in box 70 begins by receiving a request for ending data and an ending location. The step in box 72 accesses the 
ending location in string data 12. The step in box 80 determines whether the string that ends at the ending location is 
an acceptable string. If so, the step in box 82 returns ending data indicating the string is acceptable and also indicating 
a set of categories for the string. If the string is not acceptable, the step in box 84 returns data indicating it is not 
acceptable. The steps in boxes 80, 82, and 84 can be performed using acceptance information and category set 
30 information from an ending subsequence, such as ending subsequence 36 in Fig. 1 . 

[0027] The steps in Figs. 2 and 3 can be used together in character recognition. For example, processor 14 could 
perform the steps in Fig. 2 for each character of a candidate string. If any character cannot be matched starting at the 
location at which the preceding character was matched, processor 14 could obtain data indicating the string is not 
acceptable. Upon matching the last character of the string, processor 1 4 could perform the steps in Fig. 3 at the location 
35 where the last character was matched to determine whether the string is acceptable and to obtain data indicating its 
category set if it is acceptable. 

B. Implementation 

40 [0028] The general features described above could be implemented in numerous ways on various machines. The 
invention has been implemented in the Medley version of the Lisp programming language available from Venue Cor- 
poration for execution on a Sun SparcStation and in the C programming language for execution on a Sun SparcStation, 
a Macintosh personal computer, and other machines. 

45 1 . Character Subsequences 

[0029] Figs. 4-6 illustrate high level features of string data that can be used in implementing the invention. Fig. 4 
shows a character subsequence and its relationship to other subsequences. Fig. 5 shows steps in accessing character 
subsequences in response to a request to match a character. Fig. 6 shows steps in accessing a character subsequence 

so in response to a request for ending data. 

[0030] Character subsequence 100 in Fig. 4 includes several types of information. Label information 102 indicates 
a character type. Next subsequence information 104 indicates whether subsequence 100 has a next subsequence 
and, if so, the location of the next subsequence. Alt subsequence information 106 indicates whether subsequence 100 
has an alternative subsequence and, if so, the location of the alternative subsequence. Ending information 108 indicates 

55 whether a string that ends with a character matching label information 1 02 is an acceptable string and, if so, a category 
set for the string. 

[0031 ] An interlinked set of character subsequences like that shown in Fig. 4 can be linked together to represent an 
FSM. As shown, character subsequence 100 could be the next or alternative subsequence for one or more other 



4 



EP 0 638 187 B1 



subsequences. 

[0032] Fig. 5 generally follows the steps in Fig. 2. The step in box 120 begins upon receiving a call to a function 
designated "Advance," which is called with a location and a character type. The step in box 122 uses the location to 
access a character subsequence. The step in box 122 also uses the subsequence's next information to obtain data 
5 indicating the location of the subsequence's next subsequence. The step in box 124 uses the data indicating the next 
subsequence's location to access it. 

[0033] The step in box 126 begins an Iterative loop that accesses the next subsequence and its alternative subse- 
quences, if any, to find a match. The step in box 126 uses the current subsequence's label information to obtain data 
indicating a character type. The step in box 130 branches based on whether this character type is the same as the 

10 character type received in box 1 20. If so, the step in box 1 32 returns data indicating the current subsequence's location. 
But if not, the step in box 134 uses the current subsequence's alt information to obtain data indicating whether the 
subsequence has an alternative subsequence and, if so, the location of the alternative subsequence. 
[0034] If the current subsequence does not have an alternative subsequence, the branch in box 1 40 leads to the 
step in box 142, which returns data indicating that the character received in box 120 cannot be matched. But if the 

15 subsequence has an alternative subsequence, the step in box 144 uses the data indicating the alternative subse- 
quence's location to access it before returning to the step in box 126. 

[0035] Fig. 6 generally follows the steps in Fig. 3. The step in box 170 begins upon receiving a call to a function 
designated "End Status," which is called with a location. The step in box 1 72 uses the location to access a character 
subsequence. The step in box 172 also uses the subsequence's ending information to obtain data indicating whether 
20 the subsequence is the ending subsequence of an acceptable string and, if so, a set of categories for the string. 

[0036] If the subsequence is not the ending subsequence of an acceptable string, the branch in box 180 leads to 
the step in box 1 82, which returns data indicating that the string is not acceptable. But if the subsequence is the ending 
subsequence of an acceptable string, the step in box 184 returns data indicating the set of categories for the string, 
data which also implicitly indicates that the string is acceptable. 

25 

2. Encoding and Decoding 

[0037] Character subsequences could be encoded and decoded in many ways in implementing the invention. Table 
I shows one way a byte value space can be allocated to permit use of some byte values both as character bytes and 
30 as category bytes. Fig. 7 shows a character subsequence in which each data unit is a byte value from the allocation 
of Table I. Fig. 8 shows how steps in Fig. 5 could be implemented for the allocation of Table I. Fig. 9 shows steps for 
obtaining category set data. Fig. 10 shows a bit vector for returning category set data. 

[0038] The byte values in Table I range from 0 to 255, and hence are eight bits in length. N is the number of distinct 
six-bit labels, explained in more detail below. P 1 is the number of long pointer byte codes, and P 2 is the number of 

35 midsize pointer byte codes. As can be seen in Table I, the number of short pointer byte codes is 255 - (4N + P 1 + P 2 ). 
[0039] As a result of the EOB byte, each character has only four different types, the character's four possible com- 
binations of final, also referred to herein as acceptable, and alternative, also referred to herein as alt. Having only four 
types of characters is advantageous because the type of each character can be represented by two bits, which can 
be masked for comparison, avoiding the need to do a table lookup or other complicated decoding of a byte value. 

40 Therefore, rather than allocating blocks as in the FSM encoding application, Table I allocates four consecutive byte 
values to each of N - 2 characters. 



TABLE I 



Byte Values 


Interpretation 


0 


EOB Byte 


1 to 255 - (4N + P, + P 2 ) 


Short Pointer 


256 - (4N + P A +P 2 ) to 255 - (4N + P,) 


Midsize Pointer 


256 - (4N + P t ) to 255 - 4N 


Long Pointer 


256 -4N to 259 -4N 


Tagmark or Category Types l-i V 


260 -4N to 251 


Characters or Categories Types l-IV 


252-255 


Table Label Types l-IV 



[0040] Table I also allocates four consecutive byte values to a tagmark, which is not interpreted as a character but 
rather indicates that the preceding character is the end of an acceptable string. The bytes following a tagmark can thus 
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be interpreted as something other than characters, and Table I shows that the tagmark byte value and the character 
byte values can also be interpreted as category byte values. This interpretation is applied to a byte value in the tagmark 
and character/category ranges if it is accessed in a sequence of data units after accessing a tagmark. 
[0041 ] Table I also allocates four consecutive byte values to a table label, which is interpreted neither as a character 

s nor as a category, but rather indicates that the next bytes are a pointer to a table. Table labels are an optional technique 
that can be used to reduce the time necessary to search the string data representing an FSM in cases where a state 
in the FSM has a large number of outgoing transitions. Table labels have been found to be advantageous where an 
FSM transducer summarizes general rules, where an FSM permits compounding of acceptable strings, or where an 
FSM is used to classify strings that may be composites of acceptable strings and other elements such as punctuation 

10 marks. States in such situations resemble the FSM's start state, which has a large number of outgoing transitions. 
[0042] Each table can be implemented with N - 1 entries, so that there is one entry for each of the label values, from 
the tagmark byte through the character/category bytes; because the table label has the highest label value, the table 
need not have an entry for the table label itself. The entries are of equal length, so that the entry for a given label value 
can be accessed by multiplying the six-bit label value by the entry length to obtain an offset into the table. 

15 [0043] Each entry begins with a label byte. If a label value can be matched from the location preceding the table 
label, the label value's entry has a label byte with the same label value. But if the label value cannot be matched from 
the location preceding the table label, the label value's entry has a label byte with a different label value; for example, 
a table label byte could be used as the label byte for each entry that cannot be matched, since its label value is different 
than the label value of any of the entries. 

20 [0044] Each entry also includes a pointer field following the label byte, and the pointer field is large enough to hold 
a full length pointer. In the current implementation, each full length pointer is three bytes, so that the entry length is 
four bytes. If a label value can be matched from the location preceding the table label, the pointer field of the label 
value's table entry contains a pointer to the next subsequence for the label or the special EOB byte if there is no next 
subsequence. 

25 [0045] A table as described above can be shared by a state for which some of the table's entries are inappropriate. 
This can be done if the outgoing transitions for each state are ordered so that the table label is always the last transition. 
Preceding outgoing transitions can be used to provide information for transitions for which the table entry is inappro- 
priate. 

[0046] In general, the choice of which states should be followed by table labels involves a complex time-space tradeoff 
so that may depend heavily on the specif ic data and task. The use of more tables improves search speed, but may increase 
the space needed to store the string data, which include the tables. In certain situations, such as states representing 
alternative terminal punctuation marks, tables produce dramatic speed improvement, and also reduce the overall space 
because a large number of states may share a given table. 

[0047] Like the character/category byte values, the two low order bits of the tagmark byte value can be used to 
35 indicate whether it has an alternative data unit. The bit indicating whether a tagmark is the end of an acceptable string 
does not matter, since the tagmark itself is never the end of an acceptable string, but always follows a character that 
is the end of an acceptable string. The first byte value for the tagmark, referred to herein as "CharBase," is 256 - 4N, 
an integer multiple of 4. Therefore, the two low order bits of each character byte can be masked to obtain six high order 
bits identifying the character. 

40 [0048] The two low order bits of the table label can similarly be used to provide useful information, if advantageous. 
For example, they could be used to indicate the length of the pointer to the table. 

[0049] The character/category bytes can include an escape code, as described in the mapping application in relation 
to escape codes. The escape code plus another byte that follows it in the string data can be used to encode characters 
if the total number of characters is greater than N - 2, which is likely to be the case if N is chosen to optimize the number 
45 of pointers of each size. If escape codes are used for all characters that appear at ends of acceptable strings, such as 
terminal punctuation marks, the terminal punctuation marks may all be collapsed to a single state that also has a single 
transition back to the start state for a hyphen or other characters that may join the parts of a compound word. But it 
may be desirable to use an escape code for other characters, so that the choice of when to use escape codes is also 
a complex tradeoff. 

so [0050] A wide variety of alternative encodings could be used if advantageous for particular data or for a particular 
task. For example, the special EOB byte could be omitted if all branches end with a category byte and if every category 
byte is an end of branch. It might also be possible to use a single bit to indicate whether a character byte or a category 
byte has an alternative, if a character byte that is final is always followed by a tagmark byte and if a category byte is 
always final. These approaches could, for example, be used to increase the number of short pointers. The tagmark 

55 byte could also be eliminated if distinct blocks of values were allocated to characters and to categories, because the 
category bytes following a final character byte could be positioned as the first outgoing transitions-presence of a 
category byte following a character byte would indicate that the character byte is at the end of an acceptable string 
and that the category byte is at the beginning of a set of category bytes for the string. 
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[0051] The tagmark byte or its equivalent in other encodings can be thought of as a special transition that follows 
every final state in an FSM. Following a tagmark transition are a set of bytes that are decoded as categories of strings 
applicable to all the strings that ended at the final state. Examples of categories include main lexicon, user lexicon, 
roman numeral, arabic numeral, phone number, date, etc. 

s [0052] Fig. 7 shows a number of data units within sequence of data units 200. Data units 202, 204, 206, 208, 210, 
and 21 2 together form a character subsequence that includes the same information as subsequence 36 in Fig. 1. Data 
unit 220 is at the beginning of the subsequence's next character subsequence and data unit 222 is at the beginning 
of the subsequence's alternative character subsequence. Data unit 204, the tagmark byte, is illustratively shown as 
the next byte of data unit 202, so that the data units of the character subsequence are contiguous, but data unit 220 

w could instead be the next byte and data unit 204 could be an alternative, in which case the data units of the character 
would not be contiguous as shown. 

[0053] The data units within sequence 200 illustrate all the ranges of byte values in Table I except pointers. Data 
units 202, 220, and 222 are character bytes. Data unit 204 is a tagmark byte. Data units 206 and 210 are category 
bytes. And data units 208 and 212 are EOB bytes. 

15 [0054] Each of the character bytes includes two low order bits as shown, the first indicating whether the byte is at 
the end of an acceptable string and the second indicating whether the byte has an alternative. In the tagmark byte, the 
first low order bit indicates that the byte is not at the end of an acceptable string and the second indicates whether the 
byte has an alternative. In each of the category bytes, the first low order bit indicates that the byte is at the end of an 
acceptable string and the second indicates whether the byte has an alternative. 

20 [0055] The steps in Fig. 8 can replace the steps following box 120 in Fig. 5 if byte values are allocated as in Table 
I. The step in box 240 increments the location received in box 120 in Fig. 5 and retrieves the byte value at the incre- 
mented location. The step in box 242 branches based on whether the retrieved byte value is less than Charbase, 
meaning that it is a pointer byte value. If so, the step in box 244 goes to the location indicated by the pointer and 
retrieves the byte at that location. 

25 [0056] When a byte is retrieved that is not a pointer, the step in box 250 subtracts Charbase from the byte's value. 
If the character received in box 120 has previously been mapped into the range of character byte values by multiplying 
by four and adding Charbase, it is not necessary to subtract Charbase. The step in box 250 also masks the two low 
order bits, such as by logically ANDing the byte with the byte 11 111100, producing the byte's label. 
[0057] The step in box 252 compares the label from box 250 with the character received in box 1 20. The step in box 

30 254 branches based on the result of the comparison. If the label and character match, the step in box 132 in Fig. 5 is 
performed, returning the current location. 

[0058] If the label and character do not match, the step in box 260 determines whether the label is the table label. 
If not, there may be further alternatives that should be considered. The step in box 262 obtains the current location's 
alt bit, which could be saved in the step in box 250. Then the step in box 264 branches based on whether the current 
35 location has an alternative. If not, the step in box 142 in Fig. 5 is performed, returning data indicating the character 
received in box 120 could not be matched. 

[0059] If the current location has an alternative, the step in box 266 goes to the alternative's location and retrieves 
the byte at that location. The alternative can be found by counting alt bits and EOB bytes in the same manner as alts 
and EOBs are counted in the FSM encoding application and the mapping application. Then the step in box 270 branches 
40 based on whether the byte is a pointer, as in box 242. If not, the alternative has been reached so that the step in box 
250 can be performed again . If the byte is a pointer, the byte at the location it points to is retrieved as in box 244 before 
performing the step in box 250. 

[0060] When the table label is found in box 260, the step in box 280 uses the character received in box 1 20 to access 
the appropriate table entry. As noted above, this can be done by multiplying the character by four and adding the result 
45 to the pointer that follows the table label. Then, the test in box 282 branches based on whether the character matches 
the label byte of the table entry. If the character does not match the label byte, the step in box 1 42 in Fig. 5 is performed, 
returning data indicating the character received in box 120 could not be matched. But If the character matches, the 
location of the table entry is returned in box 282, which has the same effect as returning the current subsequence's 
location in box 132 in Fig. 5. 

so [0061 ] The steps in Fig. 8 can be performed beginning at the location of data unit 202 in Fig. 7. If the next character 
after the character V is a %" for example, the first iteration of the step in box 254 finds a mismatch with the tagmark 
label in data unit 204. The step in box 266 counts alt bits and EOB bytes in order to reach data unit 220, and the second 
iteration of the step in box 254 finds a match with the T label, so that the location of data unit 220 is returned in the 
step in box 132 in Fig. 5. 

55 [0062] Fig. 9 shows steps in obtaining category set data. The steps in Fig. 9 can replace boxes 1 72, 1 80, and 1 84 
in Fig. 6 if byte values are allocated as in Table I. 

[0063] The step in box 300 retrieves the byte at the location received in box 1 70 in Fig. 6. The step in box 302 then 
branches based on whether the byte value is at the end of an acceptable string, as indicated by its final bit. If not, the 
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current character subsequence is not the ending subsequence of an acceptable string, so no match is returned in box 
182. If the byte value is at the end of an acceptable string, the step in box 304 obtains the location of the tagmark byte 
by calling Advance (Loc, Tagmark). The steps in Figs. 5 and 8 are performed, returning the tagmark's location. 
[0064] The step in box 310 goes to and retrieves the next byte of the location returned by Advance. This step is the 
same as the steps in boxes 240, 242, and 244 in Fig. 8. Then the step in box 312 saves data indicating the six high 
order bits of the retrieved byte, because those bits indicate one of the categories for the acceptable string. Then, the 
step in box 320 branches based on the retrieved byte's alt bit. If the alt bit is on, the step in box 322 goes to and retrieves 
the byte's alternative byte as described above in relation to boxes 266, 270, and 272 in Fig. 8. Then the step in box 
312 is performed again. If the alt bit is off, the step in box 324 returns category set data based on the data saved in 
box 312. 

[0065] The steps in Fig. 9 can be performed beginning at the location of data unit 202 in Fig. 7 after matching the 
character B x. n The category set data returned in box 324 indicates the categories "word" and "roman numeral." 
[0066] Fig. 10 shows N-bit vector 350 that can be used to save byte values in box 312 and to return category set 
data in box 324. The step in box 312 can remove the final and alt bits from the byte value and use the higher order 
bits to indicate which bit in vector 350 should be set to save the byte value. When alt the category bytes have been 
handled in box 312, vector 350 can be returned in box 324 as the category set data. Vector 350 with all zeroes can 
also be used to return data indicating that a string is not acceptable in box 1 82 in Fig. 6, since at least one bit must be 
set if a string is acceptable. 

[0067] A more compact data structure can sometimes be obtained by systematically introducing epsilon transitions 
into an FSM to facilitate sharing of the trailing portions of sets of outgoing transitions. In other words, every state in 
the FSM can be expanded into a series of binary states, all but the last of which have an epsilon transition to the next 
binary state in the series. This epsi Ion-sharing technique also includes minimizing the expanded FSM, after which extra 
epsilon transitions are removed. 

[0068] If the epsilon-sharing technique is used, the order in which the category bytes are arranged following a tagmark 
byte can be chosen to improve collapsing on the right. For example, the most frequently occurring category can be 
the last in each set of category bytes, the least frequent first, and so forth. In the allocation of Table I, it may be advan- 
tageous to assign the most frequently occurring category to the same byte value as the most frequently occurring 
character for additional collapsing. In general, the objective is to strengthen the frequency skew to obtain better data 
compression. 

3. Product 

[0069] Encoded string data as described above can be used in a recognition system as illustrated in Fig. 11 . Fig. 12 
shows a software product that can be used in such a system. 
35 [0070] In Fig. 11 , processor 370 is connected to receive data defining images from image source 372, which could 
for example be a scanner. Processor 370 is also connected to provide data indicating recognition results to recognition 
output 374, which could for example be a display. Processor 370 is also connected to load software from software 
input device 376, which could be a floppy disk drive, a CD-ROM player, modem, network link, or other peripheral device 
that can access software on a data storage medium or receive software transmitted through a communication channel. 
40 [0071 ] I n performing recognition , processor 370 accesses instruction memory 380 and executes recognition instruc- 
tions 382, which include calls to string data access instructions 384. Data memory 390 stores data for access by 
processor 370. Processor 370 accesses string data 392 in executing string data access instructions 384 and accesses 
image data 394 in executing recognition instructions 382. 

[0072] Software product 400 in Fig. 1 2 includes data storage medium 402, which could be a floppy disk, a CD-ROM, 
45 magnetic tape, or another medium that can store data. Medium 402 stores string data 404 and instructions that can 
be executed in relation to string data 404. 

[0073] In the implementation shown in Fig. 12, the instructions include load instructions 41 0, starting location instruc- 
tions 41 2, advance instructions 41 4, end status instructions 41 6, and free instructions 41 8. 

[0074] Load instructions 410 can include a routine that processor 370 executes to load string data 404 into data 
so memory 390. This routine returns the memory location of a header that includes one or more offsets for accessing start 
states within string data 404. Processor 370 may also load the instructions on medium 402 into instruction memory 
380 before execution. 

[0075] Starting location instructions 412 can include a routine that processor 370 executes to obtain data indicating 
the starting location for accessing string data 404 within data memory 390. This routine can be called with the memory 
55 location of the header returned by load instructions 41 0 and with a start state identifier. The routine then accesses the 
header, obtains the offset to the identified start state, uses the offset to obtain the start state's access location in memory, 
and returns the access location. For consistency with other instructions, the access location for a start state is the 
location whose next subsequence is to be matched in response to the initial character of a string. 
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[0076] Advance instructions 414 can include a routine that processor 370 executes to advance within string data 
404, given a location and a character. In other words, the routine can perform steps like those in Fig. 2. 
[0077] End status instructions 41 6 can include a routine that processor 370 executes to obtain data indicating whether 
a given location is at the end of an acceptable string and, if so, a category set of the string. In other words, the routine 
can perform steps like those in Fig. 3. 

[0078] Free instructions 418 can include a routine that processor 370 executes to flush string data 404 from data 
memory 390. 

[0079] Recognition instructions 382 in Fig. 1 1 can include a routine that calls the instructions loaded from data storage 
medium 402. Because of the simplicity of the functions performed by the instructions shown on medium 402, the 
interface to other software is relatively simple, and software product 400 could be used in a variety of recognition 
techniques. 

[0080] String data 404 can be produced using techniques described in the FSM encoding application and in the spell 
checking application. The spell checking application describes how an FSM data structure can be obtained. 
[0081 ] In the current implementation, preparation of variations of string data 404 is facilitated by providing a special- 
ized user interface to FSM creation routines. The user can define a category of character strings either with the name 
of a file that includes a list of strings or with a definition in the form of a special grammar. The grammar can be used, 
for example, to define a category that has an infinite number of acceptable strings, such as the arable numbers, arith- 
metic expressions, or currency amounts; to define a category in which the acceptable strings follow algorithmic rules, 
such as the roman numbers, phone numbers, or social security numbers; or to extend a category such as a list of 
strings by adding variations that include punctuation or hyphenation, that combine strings to form compound strings, 
or that include reasonable patterns of characters similar to those occurring in the strings in the list. 
[0082] In the current implementation, a user can provide regular expressions that describe acceptable strings in 
various ways. For example, an expression can indicate that a string is acceptable if It includes, in sequence, a substring 
of beginning punctuation, an optional arabic numeral, a substring of separators, an acceptable string from a lexicon, 
another substring of separators, another acceptable string from the lexicon, and so forth, and ends with a substring of 
ending punctuation. Or, an expression can indicate that a string is acceptable if it includes, in sequence, a number 
from 1-9, a substring of from zero to two numbers from 0-9, and then either ends or continues with any number of 
substrings that include a comma followed by three numbers from 0-9. 

[0083] The definition of each category is used to produce a respective FSM. Category data is included in each FSM 
indicating that each acceptable string of that FSM ends with a tagmark and category identifier. The category for a 
composite string as described above may be the category of one of its constituent parts, such as the last string. 
[0084] If feasible, the FSMs for all the categories can be combined into a single FSM using the well known FSM 
union algorithm. The combined FSM is determinized and minimized, all without changing the acceptable set of strings. 
The resulting FSM has one start state. A sequence of states within the FSM may include a cycle, including a pointer 
from a state that leads back to the same state or to another state that occurred earlier in a sequence of states. 
[0085] If it is computationally impractical to combine the FSMs for all categories into a single determinized, minimized 
FSM, one or more additional FSMs can be included, each with a respective start state and with category data so that 
each string it accepts ends with a special category identifier. 

[0086] The resulting combination of one or more FSMs can then be encoded to produce a data structure like that 
described in the FSM encoding application. The data structure can be stored on data storage medium 402 in producing 
software product 400. 

C. Applications 

45 [0087] The invention could be applied to recognition problems in many ways, including incremental lookup techniques 
in which a main recognition routine checks a string by providing each character in sequence, so that if a set of unac- 
ceptable candidate strings share an unacceptable prefix, all of them can be eliminated in a single lookup operation. 
The main recognition routine can also maintain a data structure indicating the locations of previously provided char- 
acters so that backtracking to a previous choice point can be more efficient. 

so [0088] The technique is not limited to recognition of strings of characters, but might also be applied to recognition of 
strings of othertypes of elements where the strings fall into discrete categories. Forexample, in handwriting recognition, 
strings of strokes fall into word categories. Similarly, in speech recognition, strings of phonemes fall into word categories. 
[0089] The technique might also be applied to spelling correction problems. For example, the technique might be 
applied to sequences of characters from an input device such as a keyboard, to assist in correcting mistakes in typing. 

55 

D. Miscellaneous 

[0090] The invention has been described in relation to implementations in which a software product is delivered on 
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a data storage medium. The invention might also be implemented with a software product delivered through transmis- 
sion, such as through telephone lines or over a network. 

[0091] The invention has been described in relation to software implementations, but the invention might be imple- 
mented with specialized hardware. 



Claims 

1 . A product (12; 360; 400) for providing information about strings of characters, comprising: 

a data storage medium (390; 402); and 

string data (12; 392; 404) stored by the data storage medium; the string data indicating a set of two or more 
acceptable strings of characters; 

the string data comprising two or more data units, each of which can be accessed by a processor (14; 370); 
the data units including, for each of a set of two or more acceptable strings of characters, a respective sequence 
of data unite; the string's sequence (30) of data units beginning with a subsequence of character data units 
(32, 34; 202, 220, 222) that the processor can access using character data (1 6) indicating character types of 
the string's characters; the string's sequence of data units including acceptance data indicating that the string 
is one of the acceptable strings; 

characterised in that 

the sequence (30) of data units for at least one string in the set of acceptable strings includes a respective 
ending subsequence (36; 1 00; 202, 204, 206, 208, 21 0, 21 2) of data units that the processor can access after the 
subsequence of character data units, which ending subsequence includes two or more category data units (206, 
210), each category data unit indicating one of a set of two or more categories. 

2. The product of claim 1 in which each data unit (202, 204, 206, 208, 21 0, 21 2, 220, 222) is a byte. 

3. The product of claim 1 or 2 in which the set of acceptable strings includes strings that can be words of a language. 

4. The product of any of the preceding claims in which the said at least one string includes an ending having a 
respective character type; the string's subsequence of character data units including an ending character data unit 
(202) that includes character label information indicating the ending character's character type; the ending char- 
acter data unit including a bit indicating that a string at the end of whose respective subsequence of character data 
units the processor (1 4; 370) can access the ending character data unit is one of the acceptable strings; the string's 
acceptance data including the bit. 

5. The product of any of the preceding claims in which the string ending data of the said at least one string includes 
an acceptance data unit (204) in the string's ending subsequence (202, 204, 206, 208, 210, 212); the acceptance 
data unit having a value indicating that a string at the end of whose respective subsequence of character data 
units the processor (1 4; 370) can access the acceptance data unit is one of the acceptable strings; the first string's 
acceptance data including the acceptance data unit. 

6. The product of claim 5 in which the said at least one string's ending subsequence (202, 204, 206, 208, 21 0, 212) 
includes a set of category data units (206, 210) that the processor (14; 370) can access after accessing the ac- 
ceptance data unit (204). 

7. The product of any preceding claim in which the set of acceptable strings includes a first string including a respective 
ending subsequence of data units including two or more category data units and a second string including a re- 
spective ending subsequence of data units including one or more category data units, one category dataunit that 
is also in the first string's ending subsequence being a shared data unit that is also in the second string's ending 
subsequence. 

8. The product of any of the preceding claims in which the string data (1 2; 392; 404) represent a combined finite state 
machine (FSM). 

9. The product of claim 8 in which one of the set of categories includes infinitely many strings, the words in the set 
of acceptable words that are in the category being expressible as a cyclic FSM; the combined FSM including the 
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cyclic FSM. 

1 0. The product of any of the preceding claims, further comprising a processor (1 4; 370) connected for accessing the 
string data (12; 392; 404) stored by the data storage medium. 

5 

11 . A method of operating a system (10; 360) comprising: 

a processor (14; 370); 

memory (390; 402); the processor being connected for accessing the memory; and 
10 string data (12; 392; 404) stored by the memory; the string data indicating a set of two or more acceptable 

strings of characters; the string data comprising two or more data units, each of which can be accessed by 
the processor; the data units including, for each of a set of two or more acceptable strings of characters, a 
respective sequence of data units; the string's sequence (30) of data units beginning with a subsequence of 
character data units (32, 34; 202, 220, 222) that the processor can access using character data (1 6) indicating 
'5 character types of the string's characters; the string's sequence of data units including string ending data 

indicating that the string is one of the acceptable strings; 
the method comprising: 

(A) operating the processor to access the string data using string character data (1 6) indicating the string's 
20 characters; and 

(B) operating the processor to obtain category data (18) for the string; 

characterised in that 

the sequence (30) of data unite for at least one string in the set of acceptable strings includes a respective 
25 ending subsequence (36; 100; 202, 204, 206, 208, 210, 212) of data units that the processor can access after the 

subsequence of character data units which ending subsequence includes two or more category data units (206, 
210), each category data unit indicating one of a set of two or more categories. 



30 Patentanspriiche 

1. Ein Produkl (12; 360; 400) zur Lieferung von Informationen uber Zeichenketten, das aufweist: 

ein Datenspeichermedium (390; 402); und 
35 auf dem Datenspeichermedium gespeicherte Zeichenkettendaten (12; 392; 404); wobei die Zeichenketten- 

daten einen Satz aus zwei oder mehr akzeptablen Zeichenketten bezeichnen; 

wobei die Zeichenkettendaten zwei Oder mehr Dateneinheiten aufweisen, auf die ein Prozessor (14; 370) 
jeweils zugreifen kann; wobei die Dateneinheiten fur je einen Satz aus zwei oder mehr akzeptablen Zeichen- 
ketten eine entsprechende Sequenz von Dateneinheiten aufweisen; wobei die zeichenkettenbezogene Se- 
40 quenz (30) aus Dateneinheiten mit einer Untersequenz von Zeichendateneinheiten (32, 34; 202, 220, 222) 

beginnt, auf die der Prozessor unter Verwendung von Zeichendaten (16) zugreifen kann, die Zeichentypen 
der Zeichen der Kette bezeichnen; wobei die zeichenkettenbezogene Sequenz aus Dateneinheiten Akzep- 
tanzdaten aufweist, die anzeigen, dass es stch bei der Zeichen kette urn eine der akzeptablen Zeichenketten 
handelt; 

45 

dadurch gekennzeichnet, dass 

die Sequenz (30) aus Dateneinheiten fur mindestens eine Zeichenkette im Satz akzeptabler Zeichenketten eine 
entsprechende Enduntersequenz (36; 100; 202, 204, 206, 208, 210, 212) aus Dateneinheiten aufweist, auf die 
der Prozessor nach der Untersequenz von Zeichendateneinheiten zugreifen kann, wobei die Enduntersequenz 
so zwei oder mehr Kategoriedateneinheiten (206, 21 0) aufweist und jede Kategoriedateneinheit eine aus einem Satz 

von zwei oder mehr Kategorien bezeichnet. 

2. Produkt nach Anspruch 1 , wobei jede Dateneinheit (202, 204, 206, 208, 210, 212, 220, 222) ein Byte ist. 

55 3. Produkt nach Anspruch 1 oder 2, wobei der Satz akzeptabler Zeichenketten Zeichenketten aufweist, die Worter 
einer Sprache sein konnen. 

4. Produkt nach einem der vorhergehenden Anspruche, wobei die mindestens eine Zeichenkette eine Endung ein- 
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schlieBt, die einen entsprechenden Zeichentyp, aufweist; wobei die Untersequenz aus Zeichendaten der Zeichen- 
kette eine Endzeichendateneinheit (202) aufweist, die Zeichenkennungsinformationen enthalt, die den Zeichentyp 
des Endzeichens bezeichnen; wobei die Endzeichendateneinheit ein Bit einschlieftt, das anzeigt, dass eine Zei- 
chenkette, bei der ein Prozessor auf die Endzeichendateneinheit am Ende der jeweiligen Untersequenz aus Zei- 
chendateneinherten der Zeichenkette zugreifen kann, eine der akzeptablen Zeichenketten ist; wobei die Akzep- 
tanzdaten der Zeichenkette dieses Bit einschlieBen. 

5. Produkt nach einem dervorhergehenden Anspruche, wobei dieZeichenkettenendungsdaten dermindestens einen 
Zeichenkette eine Akzeptanzdateneinheit (204) in der Enduntersequenz der Zeichenkette (202, 204, 206, 208, 
210, 212, 220, 222) aufweist; wobei die Akzeptanzdateneinheit einen Wert aufweist, der anzeigt, dass eine Zei- 
chenkette, bei der der Prozessor (14; 370) auf die Akzeptanzdateneinheit am Ende der jeweiligen Untersequenz 
aus Zeichendateneinheiten der Zeichenkette zugreifen kann, eine der akzeptablen Zeichenketten ist; wobei die 
ersten Akzeptanzdaten der Zeichenkette die Akzeptanzdateneinheit einschlieBen. 

6. Produkt nach Anspruch 5, wobei die Enduntersequenz der mindestens einen Zeichenkette (202, 204, 206, 208, 
21 0, 21 2, 220, 222) einen Satz von Kategoriedateneinheiten (206, 21 0) aufweist, auf die der Prozessor (14; 370) 
nach dem Zugriff auf die Akzeptanzdateneinheit (204) zugreifen kann. 

7. Produkt nach einem der vorhergehenden Anspruche, wobei der Satz akzeptabier Zeichenketten eine erste Zei- 
chenkette, die eine entsprechende Enduntersequenz aus Dateneinheiten einschlieBt, die zwei oder mehr Kate- 
goriedateneinheiten einschlieftt, und eine zweite Zeichenkette aufweist, die eine entsprechende Enduntersequenz 
aus Dateneinheiten einschlieBt, die eine oder mehr Kategoriedateneinheiten einschlieBt, wobei eine Kategorieda- 
teneinheit, die auch in der Enduntersequenz der ersten Zeichenkette vorhanden ist, eine gemeinsame Dateneinheit 
darsteilt, die auch in der Enduntersequenz der zweiten Zeichenkette vorhanden ist. 

8. Produkt nach einem der vorhergehenden Anspruche, wobei die Zeichenkettendaten (12; 392; 404) eine kombi- 
nierte Maschine fur den finrten Zustand (FSM) darstellen. 

9. Produkt nach Anspruch 8, wobei eine aus dem Satz der Kategorien unendlich viele Zeichenketten aufweist, wobei 
die Worter im Satz akzeptabier Worter, die unter diese Kategorie fallen, als zyklische FSM ausgedruckt werden 
konnen; wobei die kombinierte FSM die zyklische FSM einschlieBt. 

10. Produkt nach einem der vorhergehenden Anspruche, das weiterhin einen Prozessor (14; 370) aufweist, der zum 
Zweck des Zugriffs auf die Zeichenkettendaten (12; 392; 404) angeschlossen ist, die auf dem Datenspeicherme- 
dium gespeichert sind. 

11. Verfahren zum Betrieb eines Systems (10; 360), das aufweist: 

einen Prozessor (14; 370); 

Speicher (390; 402), wobei der Prozessor zwecks Zugriff auf den Speicher angeschlossen ist; und 
im Speicher gespeicherte Zeichenkettendaten (12; 392; 404); wobei die Zeichenkettendaten einen Satz aus 
zwei oder mehr akzeptablen Zeichenketten bezeichnen; wobei die Zeichenkettendaten zwei oder mehr Da- 
teneinheiten aufweisen, auf die der Prozessor jeweils zugreifen kann; wobei die Dateneinheiten fur je einen 
Satz aus zwei oder mehr akzeptablen Zeichenketten eine entsprechende Sequenz von Dateneinheiten auf- 
weisen; wobei die zeichenkettenbezogene Sequenz (30) aus Dateneinheiten mit einer Untersequenz von Zei- 
chendateneinheiten (32, 34; 202, 220, 222) beginnt, auf die der Prozessor unter Verwendung von Zeichen- 
daten (16) zugreifen kann, die Zeichentypen der Zeichen der Kette bezeichnen; wobei die zeichenkettenbe- 
zogene Sequenz aus Dateneinheiten Zeichenketten en ddaten aufweist, die anzeigen, dass es sich bei der 
Zeichenkette urn eine der akzeptablen Zeichenketten handelt; 
wobei dieses Verfahren aufweist: 

(A) den Betrieb des Prozessors zum Zugriff auf die Zeichenkettendaten unter Nutzung von Zeichenket- 
tendaten (1 6), die die Zeichen der Kette bezeichnen; und 

(B) den Betrieb des Prozessors zur Eriangung von Kategoriedaten (18) fur die Zeichenkette; 
dadurch gekennzeichnet, dass 

die Sequenz (30) aus Dateneinheiten fur mindestens eine Zeichenkette im Satz akzeptabier Zeichenketten eine 
entsprechende Enduntersequenz (36; 100; 202, 204, 206, 208, 210, 212) aus Dateneinheiten aufweist, auf die 
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der Prozessor nach der Untersequenz von Zeichendateneinheiten zugreifen kann, wobei die Enduntersequenz 
zwei oder mehr Kategoriedateneinheiten (206, 21 0) aufweist und jede Kategoriedateneinheit eine aus einem Satz 
von zwei oder mehr Kategorien bezeichnet. 



Revendications 

1 . Produit (12 ; 360 ; 400) de fourniture d'informations concernant des chames de caracteres, comprenant : 

un moyen de mise en memoire de donnees (390 ; 402) ; et 

des donnees de chaTne (12 ; 392 ; 404) mises en memoire par le moyen de mise en memoire de donnees ; 
les donnees de chaTnes indiquant un ensemble de deux ou plusieurs chaTnes de caracteres acceptables ; 

les donnees de chaTnes comprenant deux ou plusieurs unites de donnees, un processeur (14 ; 370) pouvant 
acceder a chacunedecelles-ci ; les unitesde donnees comprenant, pourchacunefaisantpartied'un ensemble 
de deux ou plusieurs chaTnes de caracteres acceptables, une sequence respective d' unites de donnees ; la 
sequence de la chaTne (30) des unites de donnees commencant par une sous sequence d' unites de donnees 
de caracteres (32, 34 ; 202, 220, 222) auxquelles le processeur peut acceder en utilisant des donnees de 
caracteres (16) qui indiquent les types de caracteres des caracteres de la chaTne ; ta sequence de la chaTne 
des unites de donnees comprenant des donnees d'acceptation indiquant que la chaTne est une des chaTnes 
acceptables ; 

caracterise en ce que 

la sequence (30) d' unites de donnees destinee a au moins une chaTne de I'ensemble de chaTnes acceptables 
comprend une sous sequence respective de fin (36 ; 100 ; 202, 204, 206, 208, 210, 212) des unites de donnees 
auxquelles le processeur peut acceder apres ta sous sequence des unites de donnees de caractere, laquelle sous 
sequence de fin comprend deux ou plusieurs unites de donnees de categorie (206, 210), chaque unit 6 de donnees 
de categorie comprenant i'un des ensembles de deux ou plusieurs categories. 

2. Produit selon la revendication 1 , dans lequel chaque unit6 de donnees (202, 204, 206, 208, 210, 212, 220, 222) 
est un multiplet. 

3. Produit selon la revendication 1 ou 2, dans lequel i'ensemble de chaTnes acceptables comprend des chaTnes qui 
peuvent etre les mots d'un langage. 

4. Produit selon I'une quelconque des revendications precedantes, dans lequel ladite au moins une chaTne comprend 
une fin presentant un type de caractere respectif ; ta sous sequence de la chaTne d'unites de donnees de caractere 
comprenant une unite de donnees de caractere de fin (202) qui comprend une information d'etiquette de caractere 
indiquant le type de caractere du caractere de fin ; I'unite de donnees de caractere de fin comprenant un bit qui 
indique qu'une chaTne a la fin de laquelle la sous sequence respective d'unites de donnees de caractere le pro- 
cesseur (14 ; 370) peut acceder a i'unite de donnees de caractere de fin est I'une des chaTnes acceptables ; les 
donnees d'acceptation de chaTne comprenant le bit. 

5. Produit selon i'une quelconque des revendications precedantes, dans lequel les donnees defin de chaTne de ladite 
au moins une chaTne comprennent une unit 6 de donnees d'acceptation (204) dans la sous sequence de fin de 
chaTne (202, 204, 206, 208, 21 0, 21 2) ; I'unite de donnees d'acceptation ayant une valeur indiquant qu'une chaTne 
a la fin desquelles sous sequences respectives des unites de donnees de caractere le processeur (14 ; 370) peut 

, acceder a I'unite de donnees d'acceptation est I'une des chaTnes acceptables ; les donnees d'acceptation de la 
premiere chaTne comprenant I'unite de donnees d'acceptation. 

6. Produit selon la revendication 5, dans lequel ladite au moins une sous sequence de fin de chaTne (202, 204, 206, 
208, 21 0, 212) comprend un ensemble d'unites de donnees de categorie (206, 21 0) auxquelles le processeur (14 ; 
370) peut acceder apres avoir accede aux unites de donnees d'acceptation (204). 

7. Produit selon i'une quelconque des revendications precedantes, dans lequel I'ensemble de chaTnes acceptables 
comprend une premiere chaTne comprenant une sous sequence de fin correspondante des unites de donnees 
comprenant deux ou plusieurs unites de donnees de categorie et une seconde chaTne comprenant une sous 
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sequence de fin respective des unites de donnees comprenant une ou plusieurs unites de donnees de categorie, 
une unite de donnees de categorie qui se trouve egalement dans la sous sequence de fin de la premiere chame 
etant une unite de donnees partagee qui se trouve egalement dans la seconde sous sequence de fin de chaTne. 

8. Produit selon Tune quelconque des revendications preced antes, dans lequel les donn6es de chaTne (12 ; 392 ; 
404) represented une machine comb i nee a etat fini (FSM). 

9. Produit selon la revendication 8, dans lequel I'un des ensembles de categories comprend des chames infiniment 
nombreuses, les mots qui se trouvent dans I'ensemble des mots acceptables qui sont dans la categorie pouvant 
etre exprimee par une FSM cyclique ; la FSM combinee comprenant la FSM cyclique. 

10. Produit selon Tune quelconque des revendications precedantes, comprenant en outre un processeur (14 ; 370) 
connecte pour acceder aux donnees de chaTne (12 ; 392 ; 404) mises en me mo ire par le moyen de mise en 
memo ire de donnees. 

11 . Precede de fonctionnement d'un systeme (10 ; 360) comprenant : 

un processeur (14 ; 370) ; 

une memoire (390 ; 402) ; le processeur etant connecte pour acceder a la memoire ; et 

des donnees de chaTne (12 ; 392 ; 404) mises en memoire par la memoire ; les donnees de chaTne indiquant 
un ensemble de deux ou plusieurs chaTnes de caracteres acceptables ; les donnees de chaTne comprenant 
deux ou plusieurs unites de donnees, a chacune desquelles le processeur peut acceder ; les unites de donnees 
comprenant, pour chacune faisant partie d'un ensemble de deux ou plusieurs chaTnes de caracteres accep- 
tables, une sequence respective d'unrtes de donnees ; la sequence de la chaTne (30) des unites de donnees 
commencant par une sous sequence d'unrtes de donnees de caractere (32, 34 ; 202, 220, 222) auxqueltes le 
processeur peut accecler en utilisant les donnees de caractere (16) qui indiquent les types de caractere des 
caracteres de la chaTne ; la sequence de la chaTne des unites de donnees comprenant des donnees de fin de 
chaTne indiquant que la chaTne est Tune des chaTnes acceptables ; 

le proced6 comprenant : 

(A) la mise en action du processeur pour accecler aux donnees de chaTne en utilisant les donnees de 
caractere de chaTne (16) qui indiquent les caracteres de chaTne ; et 

(B) la mise en action du processeur pour obtenir les donnees de categorie (18) destinees a la chaTne ; 
caracterise en ce que 

la sequence (30) des unites de donnees destinee a au moins une chaTne dans ('ensemble des chaTnes 
acceptables comprend une sous sequence respective de fin (36 ; 100 ; 202, 204, 206, 208, 210, 212) des unites 
de donn6es auxquelles le processeur peut acceder apres la sous sequence des unites de donnees de caractere, 
laquelle sous sequence de fin comprend deux ou plusieurs unites de donnees de categorie (206, 210), chaque 
unite de donnees de categorie indiquant Tun des ensembles de deux ou plusieurs categories. 
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